Basketball is created by Canadian physical education instructor James Naismith in 1891. As time goes, the rules keep changing and the popularity grows a lot. Today, basketball is one of the most popular sports around the World. NBA represents the highest level of the basketball. We have seen a lot of greatest players in history of NBA, like Bill Russel, Wilt Chamberlain, Magic Johnson. Larry Bird, Michael Jordan, Hakeem Olajuwon, Shaquille O’Neal, Allen Iverson, Kobe Bryant, Lebron James. But today, NBA begin to change and focus more on three points shooting.
In last six seasons, Golden State Warries won three championships and accessed to five finals. It can be said they are the most dominate team in the NBA. A big reason for their rise is the “deadly” three points shooting by “Splash brothers” Stephen Curry and Klay Thompson. But if you are watching NBA in 2000, you will not believe that three points shooting will become that important. In that time, NBA was dominated by great centers like Shaquille O’Neal.
The offensive style changed a lot in today’s NBA. Back in 1999, the Spurs were using 88.6 possessions per 48 minutes according to Basketball-Reference.com. In 2017, Golden State Warriors used 102.24 possessions per 48 minutes. Both of those teams won the title in those respective years. With a faster pace, that means there’s more points scored across the league and the 3-point ball has a lot to do with that.
One of the greatest coaches of all time Gregg Popovich said “Everything is about understanding it’s about the rules of the league and what you have to do to win. And these days what’s changed it is that everybody can shoot threes.”
As said in the introduction, NBA has changed a lot of its offense and defense, every team played faster and shoot more threes. It can be said that NBA entered the era of “three points shooting”. Our team is interested in how NBA is changed according to data.
In order to do the investigation, we tried to scrape the data from the official website of NBA, but there seems to be a protection of the web producer that forbidden unauthorized users to use the data from their website. Then we searched on the internet and tend to find the best data website of NBA. After some comparison, we decide to scrape the data from the website https://www.basketball-reference.com/leagues/NBA_2020.html#all_team-stats-base. We used the table of Miscellaneous Stats. We will analyze the relationship between winning percentage with different attributes like three points attempt rate. We also wants to find the difference in different categories, like pace, through 2000-2019.
Since every column has its abbreviate name. So we provide you the glossary.
Age – Player’s age on February 1 of the season
W – Wins
L – Losses
PW – Pythagorean wins, i.e., expected wins based on points scored and allowed
PL – Pythagorean losses, i.e., expected losses based on points scored and allowed
MOV – Margin of Victory
SOS – Strength of Schedule; a rating of strength of schedule. The rating is denominated in points above/below average, where zero is average.
SRS – Simple Rating System; a team rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average.
ORtg – Offensive Rating. An estimate of points produced (players) or scored (teams) per 100 possessions
DRtg – Defensive Rating
An estimate of points allowed per 100 possessions
NRtg – Net Rating; an estimate of point differential per 100 possessions.
Pace – Pace Factor: An estimate of possessions per 48 minutes
FTr – Free Throw Attempt Rate.Number of FT Attempts Per FG Attempt
X3PAr or 3PFGAR– 3-Point Attempt Rate. Percentage of FG Attempts from 3-Point Range
TS – True Shooting Percentage. A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.Offense Four Factors
eFG – Effective Field Goal Percentage. This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
TOV – Turnover Percentage. An estimate of turnovers committed per 100 plays.
ORBOffensive Rebound Percentage. An estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.
FT/FGA – Free Throws Per Field Goal Attempt. Defense Four Factors
DRB. – Defensive Rebound Percentage. An estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.
DRB – Defensive Rebound Ball
ORB – Offensive Rebound Ball
TRB – Total Rebound Ball
AST – Assistant
G – Games
MP – Minutes Played
FG – Field Goals
FGA – Field Goal Attempts
FG. – Field Goal Percentage
X3P or 3PFG – 3-Point Field Goals
X3PA or 3PFGA– 3-Point Field Goal Attempts
X3P. or 3PFGAP – 3-Point Field Goal Percentage
X2P or 2PFG– 2-Point Field Goals
X2PA or 2PFGA– 2-point Field Goal Attempts
X2P. or 2PFGP– 2-Point Field Goal Percentage
Attend. – Attendance
WP – Winning Percentage
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
Since we used Python to scrape data from the website. So, to see how we scrape data and clean it, please go the part 4 - Python Part.
path = "/Users/zhaoyizhuang/Downloads/cs320final-master/nba_data.csv"
data <- read.csv(path)
data %>% head()
## yearID Team Age W L PW PL MOV SOS SRS ORtg DRtg
## 1 2000 Los Angeles Lakers 29.2 67 15 64 18 8.55 -0.14 8.41 107.3 98.2
## 2 2000 Portland Trail Blazers 29.6 59 23 59 23 6.40 -0.04 6.36 107.9 100.8
## 3 2000 San Antonio Spurs 30.9 53 29 58 24 5.94 -0.02 5.92 105.0 98.6
## 4 2000 Phoenix Suns 28.6 53 29 56 26 5.22 0.02 5.24 104.6 99.0
## 5 2000 Utah Jazz 31.5 55 27 54 28 4.46 0.05 4.52 107.3 102.3
## 6 2000 Indiana Pacers 30.4 56 26 54 28 4.60 -0.45 4.15 108.5 103.6
## NRtg Pace FTr X3PAr TS. eFG. TOV. ORB. FT.FGA eFG..1 TOV..1 DRB.
## 1 9.1 93.3 0.346 0.153 0.525 0.484 12.7 30.6 0.241 0.443 13.4 73.1
## 2 7.1 89.9 0.316 0.175 0.546 0.501 14.5 30.3 0.240 0.461 13.8 72.4
## 3 6.4 90.8 0.346 0.138 0.535 0.488 14.3 27.8 0.258 0.451 13.5 73.0
## 4 5.6 94.0 0.286 0.184 0.532 0.491 15.2 29.3 0.217 0.454 15.7 70.5
## 5 5.0 89.6 0.337 0.134 0.540 0.490 14.3 29.5 0.260 0.477 15.0 73.2
## 6 4.9 93.1 0.302 0.224 0.552 0.503 13.3 24.9 0.245 0.469 12.6 71.5
## FT.FGA.1 Attend. G MP FG FGA FG. X3P X3PA X3P. X2P X2PA X2P. FT
## 1 0.222 771420 82 19805 3276 7288 0.450 534 1656 0.322 2742 5632 0.487 1521
## 2 0.217 835078 82 19830 3044 6635 0.459 439 1223 0.359 2605 5412 0.481 1956
## 3 0.188 884450 82 19730 3195 7047 0.453 519 1326 0.391 2676 5721 0.468 1407
## 4 0.245 773115 82 19730 3047 6640 0.459 583 1487 0.392 2464 5153 0.478 1629
## 5 0.256 801268 82 19855 3174 6827 0.465 394 1069 0.369 2780 5758 0.483 1558
## 6 0.197 752145 82 19805 3137 6836 0.459 344 1047 0.329 2793 5789 0.482 1649
## FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS
## 1 2016 0.754 1056 2635 3691 1953 787 381 1325 1729 8607
## 2 2506 0.781 917 2458 3375 1707 665 273 1288 2011 8483
## 3 1751 0.804 931 2444 3375 1810 592 416 1124 1770 8316
## 4 2008 0.811 842 2612 3454 1857 559 422 1159 1786 8306
## 5 1982 0.786 1016 2373 3389 1852 671 381 1230 2020 8300
## 6 2368 0.696 1117 2738 3855 1921 613 534 1143 1841 8267
To find the relationship between those stats and the winning percentage. We have to first add a column contains the winning percentage. We use the formula Win/(Win+Lose) to find the winning percentage. Also, we add a new column called year to represent 5 year intervals.
data <- data %>% mutate(WP = W/(W+L))
#cut year into 5 intervals
data <- data %>%
mutate(year = cut(yearID, breaks = 5))
data %>% head()
## yearID Team Age W L PW PL MOV SOS SRS ORtg DRtg
## 1 2000 Los Angeles Lakers 29.2 67 15 64 18 8.55 -0.14 8.41 107.3 98.2
## 2 2000 Portland Trail Blazers 29.6 59 23 59 23 6.40 -0.04 6.36 107.9 100.8
## 3 2000 San Antonio Spurs 30.9 53 29 58 24 5.94 -0.02 5.92 105.0 98.6
## 4 2000 Phoenix Suns 28.6 53 29 56 26 5.22 0.02 5.24 104.6 99.0
## 5 2000 Utah Jazz 31.5 55 27 54 28 4.46 0.05 4.52 107.3 102.3
## 6 2000 Indiana Pacers 30.4 56 26 54 28 4.60 -0.45 4.15 108.5 103.6
## NRtg Pace FTr X3PAr TS. eFG. TOV. ORB. FT.FGA eFG..1 TOV..1 DRB.
## 1 9.1 93.3 0.346 0.153 0.525 0.484 12.7 30.6 0.241 0.443 13.4 73.1
## 2 7.1 89.9 0.316 0.175 0.546 0.501 14.5 30.3 0.240 0.461 13.8 72.4
## 3 6.4 90.8 0.346 0.138 0.535 0.488 14.3 27.8 0.258 0.451 13.5 73.0
## 4 5.6 94.0 0.286 0.184 0.532 0.491 15.2 29.3 0.217 0.454 15.7 70.5
## 5 5.0 89.6 0.337 0.134 0.540 0.490 14.3 29.5 0.260 0.477 15.0 73.2
## 6 4.9 93.1 0.302 0.224 0.552 0.503 13.3 24.9 0.245 0.469 12.6 71.5
## FT.FGA.1 Attend. G MP FG FGA FG. X3P X3PA X3P. X2P X2PA X2P. FT
## 1 0.222 771420 82 19805 3276 7288 0.450 534 1656 0.322 2742 5632 0.487 1521
## 2 0.217 835078 82 19830 3044 6635 0.459 439 1223 0.359 2605 5412 0.481 1956
## 3 0.188 884450 82 19730 3195 7047 0.453 519 1326 0.391 2676 5721 0.468 1407
## 4 0.245 773115 82 19730 3047 6640 0.459 583 1487 0.392 2464 5153 0.478 1629
## 5 0.256 801268 82 19855 3174 6827 0.465 394 1069 0.369 2780 5758 0.483 1558
## 6 0.197 752145 82 19805 3137 6836 0.459 344 1047 0.329 2793 5789 0.482 1649
## FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS WP year
## 1 2016 0.754 1056 2635 3691 1953 787 381 1325 1729 8607 0.8170732 (2000,2004]
## 2 2506 0.781 917 2458 3375 1707 665 273 1288 2011 8483 0.7195122 (2000,2004]
## 3 1751 0.804 931 2444 3375 1810 592 416 1124 1770 8316 0.6463415 (2000,2004]
## 4 2008 0.811 842 2612 3454 1857 559 422 1159 1786 8306 0.6463415 (2000,2004]
## 5 1982 0.786 1016 2373 3389 1852 671 381 1230 2020 8300 0.6707317 (2000,2004]
## 6 2368 0.696 1117 2738 3855 1921 613 534 1143 1841 8267 0.6829268 (2000,2004]
data %>% ggplot(aes(x = Pace, y = WP, color = yearID)) +
geom_point() +
labs(title = "Winning percentage vs. Pace",
x = "Pace",
y = "Winning percentage") +
geom_smooth(method=lm)
Pace is an estimate of possessions per 48 minutes. A possession in basketball means one team ends it offense and turn to defense. There are a lot of ways to end one teams offense possession, it can be one player scored, on player missed shot and one player turned over. As the graph shows, we can find out that the pace increased through 2000 to 2019 in NBA. Every team played more and more possessions in 48 minutes. As we all know, except overtimes, every game is 48 minutes, which have not changed through 2000 to 2019. In rules of NBA, each offensive possesion is 24 seconds. This means in each game, two teams need to shoot the ball faster in every possession. Also, in this season, time for every possension after an offensive rebounds change from 24 seconds to 12 seconds. So I believe pace in the future will keep increased. However, we can not conclude any relationship between pace and winning percentage through the graph.
data %>% ggplot(aes(x = ORtg, y = WP, color = yearID)) +
geom_point() +
labs(title = "Winning percentage vs. Offensive Rating",
x = "Offensive Rating",
y = "Winning percentage") +
geom_smooth(method=lm)
data %>% ggplot(aes(x = yearID, y = ORtg, color = yearID)) +
geom_point() +
labs(title = "Offensive Rating vs. Year",
x = "Year",
y = "Offensive Rating") +
geom_smooth(method=lm)
Offensive Rating is An estimate of points produced (players) or scored (teams) per 100 possessions. In the first graph, we can see a strong positive relationship between offensive rating and winning percentage. Whenever through 2000 to 2019, higher offensive rating will lead to higher winning percentage. If you want to win, you must be able to score points. It is the common rule in any sports. In the second graoh, we can find out that in general, offensive rating becomes higher and higher through 2000 to 2019. We believe it is because higher pace and more three points attempt.
p1 = ggplot(data = data, aes(x = as.character(yearID), y = X3PA)) + geom_boxplot()
p1 + ggtitle("3-Point Field Goal Attempt Over Time") + xlab("Year") + ylab("3-Point Field Goal Attempt")
Frorom the above graph, we can see that in recent year, the 3-point field goal(3PFG) attempt is increasing which shows that nowadyas NBA are more incling to shoot 3PFG. So, in this section, we are going to discuss why does this trend happen.
data %>% ggplot(aes(x = X3PAr, y = WP, color = yearID)) +
geom_point() +
labs(title = "Winning percentage vs. Three points attempt rate",
x = "Three points attempt rate",
y = "Winning percentage") +
geom_smooth(method=lm)
As the above graph shown, we can see that even though in recent few years NBA players have more attempts to shoot from the three points range, the distribution of the Winning percentage of each team does not change a lot. Namely, the three points attempt rate in NBA is increasing over year, but it actually did not have the directly relationship with the winning percentage of each team. So, it is just the trend of how NBA players play game. For the further analysis, such as what caused this trend, we need to look deeper into the data. For example, we can find the relationship between the three points field goal percentage and the winning percentage.
data %>% ggplot(aes(x = X3P. , y = WP, color = yearID)) +
geom_point() +
labs(title = "Winning percentage vs. Three points field goal percentage",
x = "Three points field goal percentage",
y = "Winning percentage") +
geom_smooth(method=lm)
According to the above graph, we can see a regression line that shows the relationship between three points field goal percentage and winning percentage for each team. Though it is not very clear, we still can see that winning percentage is higher when the three points field goal percentage is higher, especially for recent few years. Namely, if a team has a very high three points field goal percentage, this team is more likely to win the game. So, this can be one factor that explains the trend that why NBA teams have a higher three points attepmt rate than before.
However, we cannot conclude that the reason why NBA teams nowadays have a much higher average three-points attempt rate than before is because higher X3P. (3-Point Field Goal Percentage). Because As the below shown, a team with a high X2P. (2-point Field Goal Percentage) will also has a high winning percentage as well.
data %>% ggplot(aes(x = X2P. , y = WP, color = yearID)) +
geom_point() +
labs(title = "Two points field goal percentage vs. Winning percentage",
x = "Two points field goal percentage",
y = "Winning percentage") +
geom_smooth(method=lm)
So, we now look deeper into the dataset to figure out the relationship between FGA (field goal attempt) and FGP(Field Goal Percentage).
data %>% ggplot(aes(x = X3PA , y = X3P., color = WP)) +
geom_point() +
labs(title = "Three points field goal percentage vs. Three points field goal attempts",
x = "Three points field goal attempts",
y = "Three points field goal percentage") +
geom_smooth(method=lm)
data %>% ggplot(aes(x = X2PA , y = X2P., color = WP)) +
geom_point() +
labs(title = "Two points field goal percentage vs. Two points field goal attempts",
x = "Two points field goal attempts",
y = "Two points field goal percentage") +
geom_smooth(method=lm)
Based on above two graphs, we draw two regression lines which shows the relationship between FGA and FGP. And we can clearly see that 3PFG(3-points field goal) attempt is directly proportional to 3PFG (3-points field goal) percentage while 2PFG(2-point field goal) attempt is inversely proportional to 2PFG percentage. So, if we only look at the data, we can say that more 2PFG attempt leads to lower 2PFG percentage. And based on the graph on 3.3.3, the lower 2PFG percentage leads to lower winning percentage. The same idea for 3PFG. More 3PFG attempt leads to slightly higher 3PFG precentage, which based on 3.3.2, can lead to a higher winning percentage. This can be a reason to explain why nowadays teams decide to shoot from 3-points range.
In fact, there always are more than one reason to form a trend. NBA teams nowadays have a higher average 3PFG attempt rate than beofore may be cuased by the reason that audience want to see 3-points game. Namely, nowadays audience are more inclined to see how NBA players kill the game by shooting 3-points. The different Aesthetic leads to the change of the NBA gaming model. So, we will compare the attendance with the 3PFG attempts to see how do these two things relate to each other.
data %>% ggplot(aes(x = X3PA , y = Attend., color = yearID)) +
geom_point() +
labs(title = "Attendance vs. Three points field goal attempts",
x = "Three points field goal attempts",
y = "Attendance") +
geom_smooth(method=lm)
So, as the above graph has shown, the attendance number of the audience is directly proportional to the number of three points field goal attempts. This means that people are more willing to see the team which is good at three points field goal. Beside the change of audience’s Aesthetic and the goal to win, this trend may still has some relationships with the change of the NBA rules and styles. NBA now encourages teams to play a fast paced game, which may leads to the trend that 3FPG attempt rate rises. After comparing the pace and the 3PFG attempt, as shown below, the above hypothesis can be accepted.
data %>% ggplot(aes(x = X3PA , y = Pace, color = yearID)) +
geom_point() +
labs(title = "Pace vs. Three points field goal attempts",
x = "Three points field goal attempts",
y = "Pace") +
geom_smooth(method=lm)
So, based on what we did so far, we can see that in recent few years, the three points field goal attempt rate is much higher than it in before. We try to find the reason behind it. Based on what we got from the dataset, we state that it may be caused by the changing of the game style, the changing of audience’s Aesthetic and the goal to win.
In this section we want to discuss more about the change of the trend of how teams play in NBA. So, we draw several graphs for attribute vs. winning percentage based on 5 year intervals. By this way, we can see more clearly that how does an attribute contributes to the game during a specific time period.
Below graph shows Winning Percentage vs. Total Rebound Ball over year.
data %>%
ggplot(aes(x=TRB, y=WP)) +
geom_point(aes(color = year)) +
facet_wrap(~year) +
xlab("Total Reebound Ball") + ylab("Winning Perercentage") +
ggtitle("Winning Percentage vs Total Reebound Ball") +
geom_smooth(method = 'lm') + labs(color = "Time period")
regression <- lm(WP~TRB*year, data = data)
model <- regression %>% broom::tidy()
model
## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.729 0.333 -2.19 0.0291
## 2 TRB 0.000352 0.0000954 3.69 0.000243
## 3 year(2004,2008] 0.399 0.496 0.804 0.422
## 4 year(2008,2011] -0.418 0.490 -0.853 0.394
## 5 year(2011,2015] 1.06 0.357 2.98 0.00297
## 6 year(2015,2019] -0.0587 0.450 -0.130 0.896
## 7 TRB:year(2004,2008] -0.000109 0.000144 -0.754 0.451
## 8 TRB:year(2008,2011] 0.000130 0.000142 0.918 0.359
## 9 TRB:year(2011,2015] -0.000303 0.000103 -2.95 0.00335
## 10 TRB:year(2015,2019] 0.00000453 0.000127 0.0357 0.972
So, based on the above graph and statistics, we can see that in 2015-2019, teams grabbed more than other four time periods. It is may caused by the reason that pace is faster. The line is flatter in 2011-2015 than other four graphs. This is because the points are distributed more seperately in horizon. In general, more rebound balls bring higher winning percentage.
data %>%
ggplot(aes(x=AST, y=WP, color = year)) +
geom_point() +
xlab("Assistant") + ylab("Winning Perercentage") +
ggtitle("Winning Percentage vs Assistant") +
geom_smooth(method = 'lm') + labs(color = "Time period")
regression <- lm(WP~AST*year, data = data)
model <- regression %>% broom::tidy()
model
## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.357 0.153 -2.33 0.0202
## 2 AST 0.000478 0.0000852 5.61 0.0000000314
## 3 year(2004,2008] 0.234 0.213 1.10 0.273
## 4 year(2008,2011] -0.0848 0.225 -0.377 0.707
## 5 year(2011,2015] 0.588 0.181 3.24 0.00126
## 6 year(2015,2019] 0.0744 0.201 0.371 0.711
## 7 AST:year(2004,2008] -0.000118 0.000121 -0.977 0.329
## 8 AST:year(2008,2011] 0.0000600 0.000127 0.473 0.636
## 9 AST:year(2011,2015] -0.000319 0.000102 -3.13 0.00185
## 10 AST:year(2015,2019] -0.0000659 0.000109 -0.605 0.545
From above graphs and statistics, we can draw a conclusion that More assistants lead to a higher winning percentage. Namely, whether a team win or not depends on the number of the Assistants in some degree. In 2015 - 2019, the the number of Assistants is more than other four time periods. It may caused by the fact that there are more offensive positions which causes more opportunities to gain assistants.
data %>%
ggplot(aes(x=X3P., y=WP)) +
geom_point(aes(color = year)) +
facet_wrap(~year) +
xlab("3PFG Percentage") + ylab("Winning Perercentage") +
ggtitle("Winning Percentage vs 3PFG Percentage") +
geom_smooth(method = 'lm') + labs(color = "Time period")
regression <- lm(WP~X3P.*year, data = data)
model <- regression %>% broom::tidy()
model
## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.391 0.197 -1.99 0.0472
## 2 X3P. 2.54 0.560 4.54 0.00000689
## 3 year(2004,2008] 0.0406 0.315 0.129 0.897
## 4 year(2008,2011] -0.610 0.306 -1.99 0.0468
## 5 year(2011,2015] -0.261 0.297 -0.881 0.379
## 6 year(2015,2019] -0.483 0.345 -1.40 0.162
## 7 X3P.:year(2004,2008] -0.132 0.893 -0.148 0.882
## 8 X3P.:year(2008,2011] 1.65 0.860 1.91 0.0563
## 9 X3P.:year(2011,2015] 0.723 0.841 0.859 0.391
## 10 X3P.:year(2015,2019] 1.31 0.971 1.35 0.178
As the above graphs shown, the cluster of points and the regression line are moving to the right, which means the 3-point field goal percentage is improving and the number of 3-point field goal is increasing. Also, 3-point field goal percentage has positive relationship with the winning percentage, which means higher 3-point field goal percentage leads to win.